Synthetic data is one of those ideas that sounds like it shouldn’t work — if the data is fake, how can a model trained on it generalize to real data? The answer is that “fake” here doesn’t mean arbitrary. Synthetic data generated properly preserves the statistical structure of the original: the marginal distributions of each variable, and the dependence structure between them. If you get that right, the downstream model often can’t tell the difference.
Why bother? Two reasons come up constantly: small samples and privacy constraints. When you don’t have enough data to train a reliable model, generating additional synthetic observations from the learned distribution can help. When your data contains sensitive information that limits how freely it can be shared or used, synthetic data gives you a safe stand-in.
Copulas are the right tool for this. A copula separates two things that are usually bundled together in a multivariate distribution: the marginal distribution of each individual variable, and the dependence structure among them. This separation matters because real data is rarely multivariate normal — the marginals may follow different parametric families, and the correlations in the tails may differ from correlations in the center. Copulas handle both.
To give a concrete example, let’s use a few variables from the classic Cars Dataset.
import warningswarnings.filterwarnings('ignore')import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom copulas.multivariate import GaussianMultivariatefrom copulas.univariate import ParametricType, Univariatedf = sns.load_dataset("mpg")df=df.drop(columns=['origin', 'name'])df=df.dropna()df.columnsdf=df[['horsepower', 'weight','acceleration','mpg']]df.describe()
horsepower
weight
acceleration
mpg
count
392.000000
392.000000
392.000000
392.000000
mean
104.469388
2977.584184
15.541327
23.445918
std
38.491160
849.402560
2.758864
7.805007
min
46.000000
1613.000000
8.000000
9.000000
25%
75.000000
2225.250000
13.775000
17.000000
50%
93.500000
2803.500000
15.500000
22.750000
75%
126.000000
3614.750000
17.025000
29.000000
max
230.000000
5140.000000
24.800000
46.600000
Let’s look at the distribution of each variable, the pairwise scatter plots, and the correlations.
A few things stand out. The variables are strongly correlated in some pairs. The distributions are different: acceleration is close to normal, but horsepower and weight have heavier right tails. A Gaussian copula with normal marginals would miss that — which is why we want the copula to handle marginals separately.
Before generating synthetic data, let’s fit a simple linear regression on the original data as a baseline.
y = df.pop('mpg')X = dfX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=42)model = LinearRegression()model.fit(X_train, y_train)print(model.score(X_test, y_test))
0.6501833421053667
Now we generate a synthetic training set using a Gaussian copula, where the marginals are estimated separately as parametric distributions (not KDE). The copula model will fit the best-fitting parametric family for each variable, then learn the correlation structure among them.
# Select the best PARAMETRIC univariate (no KDE)univariate = Univariate(parametric=ParametricType.PARAMETRIC)def create_synthetic(X, y):""" This function combines X and y into a single dataset D, models it using a Gaussian copula, and generates a synthetic dataset S. It returns the new, synthetic versions of X and y. """ dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1) distribs = GaussianMultivariate(distribution=univariate) distribs.fit(dataset) synthetic = distribs.sample(len(dataset)) X = synthetic.values[:, :-1] y = synthetic.values[:, -1]return X, y, distribsX_synthetic, y_synthetic, dist= create_synthetic(X_train, y_train)
Let’s inspect what the copula model learned — the fitted distribution for each variable:
The descriptive statistics are close. The pair plot shows the shapes are similar but not identical — the tails in particular may differ somewhat, which is expected given the Gaussian copula assumption on the dependence structure. If the tail behavior mattered critically here, a Clayton or Gumbel copula would be more appropriate.
Finally, re-fit the same linear regression on the synthetic training data and evaluate on the original test set:
model = LinearRegression()model.fit(X_synthetic, y_synthetic)print(model.score(X_test, y_test))
0.6236318551606799
The R² is very close to the one from the original data. Not identical — there’s some information loss in the synthetic generation process — but close enough to demonstrate the point: the copula has preserved the statistical structure that the regression model actually needs to make good predictions.
This matters in practice. Synthetic data generated this way isn’t just noise with the right summary statistics; it captures the correlations and distributional shapes that determine model behavior. For augmentation, privacy-preserving data sharing, or testing model robustness, that’s exactly what you need.